Feat/rich oteL lmetrics#975
Merged
Merged
Conversation
9 tasks
a49af1c to
e7eb0ff
Compare
a7cbffc to
7561ac2
Compare
6 tasks
9 tasks
532e289 to
1e482d4
Compare
9ff46bb to
c0069d1
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR wires previously-defined-but-dead OpenTelemetry metrics across the KMS server, spanning database, HTTP, object-count, cache, and HSM paths, and updates docs/tests to validate OTLP export behavior.
Changes:
- Add DB operation count/duration instrumentation via an injected
DbMetricsRecordertrait, implemented byOtelMetrics. - Add Actix middleware for HTTP request counters/duration and active connection tracking, plus path normalization to cap cardinality.
- Implement privileged object/key counting across backends (SQL/Redis/HSM), add Redis O(1) counters + reconcile, update docs/changelog, and extend OTLP export tests.
Reviewed changes
Copilot reviewed 45 out of 46 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| nix/expected-hashes/ui.vendor.non-fips.sha256 | Update pinned UI vendor hash (non-FIPS) for Nix builds. |
| nix/expected-hashes/ui.vendor.fips.sha256 | Update pinned UI vendor hash (FIPS) for Nix builds. |
| nix/expected-hashes/server.vendor.static.sha256 | Update pinned server vendor/static hash for Nix builds. |
| nix/expected-hashes/cli.vendor.linux.sha256 | Update pinned CLI vendor hash for Nix builds. |
| monitoring/OTLP_METRICS.md | Reformat/extend metrics documentation and add resource links. |
| documentation/mkdocs.yml | Add nav entry for new “Metrics reference” page under Monitoring. |
| documentation/docs/configuration/otlp-metrics.md | New metrics reference page detailing instruments, labels, and semantics. |
| documentation/docs/configuration/logging.md | Link logging/telemetry doc to the new metrics reference. |
| crate/server/src/start_kms_server.rs | Install OTEL HTTP metrics middleware at the App level. |
| crate/server/src/middlewares/otel_http_middleware.rs | New Actix middleware for HTTP metrics + path normalization + tests. |
| crate/server/src/middlewares/mod.rs | Export OtelHttpMetrics middleware from module. |
| crate/server/src/cron.rs | Replace Locate-based key counting with privileged DB counts; add reconcile ticker. |
| crate/server/src/core/wrapping/wrap.rs | Record HSM operation metric for wrap path using crypto oracle. |
| crate/server/src/core/wrapping/unwrap.rs | Record HSM operation metric for unwrap path using crypto oracle. |
| crate/server/src/core/uid_utils.rs | Add hsm_model_from_prefix helper + tests; re-export HSM instance params usage. |
| crate/server/src/core/otel_metrics.rs | Implement gauges for object/key counts, add DB recorder impl, add more metric tests, cardinality cap behavior. |
| crate/server/src/core/operations/key_ops/crypto_op.rs | Record HSM operation metric on oracle-routed crypto operations. |
| crate/server/src/core/kms/other_kms_methods.rs | Record cache hit/miss/insert metrics in unwrap-cache path. |
| crate/server/src/core/kms/mod.rs | Inject DB metrics recorder into Database; seed object/key gauges at startup; update otel reader/resource setup for otel 0.29. |
| crate/server/src/config/params/mod.rs | Re-export HsmInstanceParams. |
| crate/server/src/config/mod.rs | Re-export HsmInstanceParams from config module. |
| crate/server/Cargo.toml | Add opentelemetry_sdk testing feature for metric tests. |
| crate/server_database/src/tests/mod.rs | Register additional Redis-findex tests for new counters. |
| crate/server_database/src/stores/sql/sqlite.rs | Implement privileged count queries + add loader and E2E tests for count query presence/correctness. |
| crate/server_database/src/stores/sql/query.sql | Add named count queries for non-destroyed objects and keys (SQLite/PG variants). |
| crate/server_database/src/stores/sql/query_mysql.sql | Add named count queries for non-destroyed objects and keys (MySQL). |
| crate/server_database/src/stores/sql/pgsql.rs | Implement privileged count queries for objects and keys (PostgreSQL). |
| crate/server_database/src/stores/sql/mysql.rs | Implement privileged count queries for objects and keys (MySQL). |
| crate/server_database/src/stores/redis/redis_with_findex.rs | Maintain Redis O(1) live-object + key counters across CRUD/state/atomic; implement count + reconcile paths. |
| crate/server_database/src/stores/redis/objects_db.rs | Add Redis counter keys + adjust/get/set + SCAN bootstrap counting helpers. |
| crate/server_database/src/stores/redis/additional_redis_findex_tests.rs | Add integration tests for Redis live-object and active-key counters + bootstrap behavior. |
| crate/server_database/src/lib.rs | Re-export DbMetricsRecorder. |
| crate/server_database/src/core/unwrapped_cache.rs | Update tests to pass new Database::instantiate signature (recorder arg). |
| crate/server_database/src/core/mod.rs | Add recorder field to Database; add MainDbKind::as_str; thread recorder through instantiation. |
| crate/server_database/src/core/db_metrics.rs | New trait defining the DB→server metrics recording interface. |
| crate/server_database/src/core/database_permissions.rs | Wrap permission DB facade methods with record(...). |
| crate/server_database/src/core/database_objects.rs | Add central record(...) wrapper; wrap object-store facade methods; add privileged count/reconcile facade APIs + recorder test. |
| crate/interfaces/src/stores/objects_store.rs | Extend ObjectsStore with default count/reconcile methods (metrics-only privileged APIs). |
| crate/interfaces/src/hsm/hsm_store.rs | Implement HSM count methods; add mockall-based unit test ensuring count delegation/behavior. |
| crate/interfaces/Cargo.toml | Add dev-deps for mock-based async tests. |
| CHANGELOG/feat_richOTELmetrics.md | Branch changelog documenting metric wiring steps and rationale. |
| Cargo.toml | Bump OpenTelemetry crates to 0.29 workspace-wide. |
| Cargo.lock | Lockfile updates for OTel bump + new dev-dependencies (mockall, etc.). |
| .github/scripts/test/test_otel_export.sh | Extend OTEL export integration script to assert new metric families are non-zero and labels present; add wrapped-key scenario for cache metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7bd470d to
58e3f27
Compare
Manuthor
reviewed
Jun 16, 2026
fix: fix some useless test vector generator
feat: first try with object count
fix: fixes test: others fix: hash fix: update nix vendor hashes fix: ci fix: ci2 fix: ci3 fix: ci4 fix: ciC fix: manifest test data fix: ci_9 fix: blue lock fix: nix feat: huge ci feat: fix
58e3f27 to
0dc7468
Compare
HatemMn
commented
Jun 16, 2026
Manuthor
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
read the issue
How to read this PR
The 8 metrics that are asked to be integrated operate (almost) independently and each one of them is coded somewhere in the codebase.
In this PR, we will try to work on half of them :
Canonical Global Plan
1. Database metrics
commit: read below
Goal: wire database operations total and database operation duration at database entry points.
Output: every database call path emits count and duration with normalized operation naming.
Done when: database metrics are non-zero during normal CRUD and permission flows, with no double counting.
2. Http metrics
commit: read below
HTTP metrics wiring.
Goal: wire HTTP requests total, HTTP request duration, and active connections via middleware.
Output: one middleware layer that increments active connections on request start and decrements on completion, including error paths.
Done when: method/path/status request metrics appear and active connections returns to baseline after load bursts.
3. Object metrics
commit: read below
Objects total metric wiring.
Goal: wire objects total in the periodic metrics refresh job.
We want to wire both active_keys_count_value as well as kms.objects.total. A file on the matter alread exists : crate/server/src/cron.rs. It's wrong for multi admin deployments and shall be deleted
3.bis : fix the active_keys_count by changing/deleting that cron job and reusing the code of the previous step
4. HSM and Cache metrics
commit: read below
Cache and HSM metrics wiring.
Goal: wire cache operations total and hsm operations total.
Output:
Human readable summarry of the technical work done, for revierwers :
all the text below was AI generated but it's actually relevant
e2b28745— DB MetricsMetrics:
kms.db.operations.total,kms.db.operation.durationWhat was done:
A
DbMetricsRecordertrait was introduced inserver_databaseas the only bridge between the database layer and the OTEL layer — keeping the dependency arrow strictly one-directional (server → server_database, never the reverse).OtelMetricsin the server crate implements that trait.The
Databasefacade gained a private genericrecord()helper method that wraps any async DB call: it starts a timer, awaits the future, determines"success"or"error"from the result, then calls the injected recorder with the operation name, backend kind ("sqlite","postgresql","mysql","redis"), outcome, and elapsed wall-clock time. Every public method onDatabase—create,retrieve_object,retrieve_objects,retrieve_tags,update_object,delete,find,atomic,is_object_owned_by,grant_ops,remove_ops, etc. — was wrapped with this helper.Both the counter (
kms.db.operations.total) and the histogram (kms.db.operation.duration) carry the same three labels:operation,backend,outcome.bdc30123— HTTP Metrics MiddlewareMetrics:
kms.http.requests.total,kms.http.request.duration,kms.active.connectionsWhat was done:
A new Actix-web middleware
OtelHttpMetricswas created in otel_http_middleware.rs. It wraps every inbound request as the outermostApp-level layer so it captures total latency including auth and routing.On each request it:
kms.active.connections(anUpDownCounter) immediately before calling the inner service.std::time::Instanttimer.kms.active.connectionsunconditionally in the completion path.kms.http.requests.total(counter) andkms.http.request.duration(histogram) with labelsmethod,path,status.A
normalize_path()function maps the raw URI to a low-cardinality label set (e.g. all KMIP calls become/kmip/2_1, UI assets become/ui/...) to prevent metric cardinality explosion from hashed filenames or per-UID REST paths.When OTLP is not configured,
metricsisNoneand the entire middleware is a zero-overhead pass-through.276c3b97— Objects Total (Step 3, first pass)Metrics:
kms.objects.total,kms.keys.active.countWhat was done:
Two new gauges were added to
OtelMetrics:kms_objects_total(kms.objects.total) andactive_keys_count(kms.keys.active.count). Both are plaini64_gaugeinstruments that record an absolute snapshot — no delta arithmetic.The
ObjectsStoretrait gained acount_all_non_destroyed()default method. At this point the default just returned 0 with awarn!()log so it was safe to deploy without breaking backends that hadn't implemented it yet.The existing
cron.rsmetrics loop was extended to callkms.database.count_all_non_destroyed_objects()every 30 seconds and feed the result intometrics.update_objects_total(count).This commit still carried a
TODOnote on the cron active-keys path because the oldLocate-based approach was known to undercount in multi-admin deployments; that was addressed in step 3-bis.16043a5f— Redis-Findex Backend for Object Count (Step 3-bis, part 1)Metrics:
kms.objects.total,kms.keys.active.countWhat was done:
The Redis-findex backend (redis) received a real implementation of
count_all_non_destroyed. Because Redis has no nativeCOUNT WHERE state != Destroyedquery, a dedicated O(1) counter key was introduced in theObjectsDblayer:adjust_live_count(delta: i64)— increments or decrements the Redis counter atomically alongside every create/delete/destroy operation.get_live_count() / set_live_count()— read and overwrite the counter for reconcile/repair.scan_count_non_destroyed()— full scan fallback that iterates all keys and counts those not inDestroyedstate, used to seed or correct the fast-path counter.The fast-path counter is kept consistent in the hot path; a periodic full-scan reconcile (every 5 minutes, added in the follow-up commit) corrects any drift from partial failures.
Integration tests were added in
additional_redis_findex_tests.rsto exercise create/destroy sequences and verify the count stays accurate.e52832fa— Step 3-bis: Full Objects Count + Old Cron CleanupMetrics:
kms.objects.total,kms.keys.active.countWhat was done:
This is the commit that closed the multi-admin correctness problem noted in step 3.
The old
cron.rslogic that used a KMIPLocatecall filtered toState::Activewas deleted. That approach had an inherent ACL bias: if the cron user didn't own all keys, the count would undercount. It was replaced with a privileged backend call —kms.database.count_non_destroyed_key_objects()— which queries the raw store directly, bypassing KMIP access control.A new 5-minute
reconcile_intervalticker was added tocron.rs. It callskms.database.reconcile_all_object_counts(), which is a no-op for SQL backends (they compute the count on-the-fly from a real SQLCOUNT) but repairs the Redis O(1) counter by runningscan_count_non_destroyed()and overwriting the fast-path key if it has drifted.SQL backends (SQLite, PostgreSQL, MySQL) received real
count_all_non_destroyedimplementations using aSELECT COUNT(*) WHERE state != 'Destroyed'query added toquery.sqlandquery_mysql.sql. TheHsmStoreimplementation was also wired, delegating tocount_non_destroyed_keys()(see next commit).8ebecb41— Step 3-bis: HSM Object Count + TestsMetrics:
kms.objects.total(via HSM object store path)What was done:
The
HsmStore'scount_all_non_destroyed()override changed from returning a hard-coded 0 (with a suppressed warning) to callingself.count_non_destroyed_keys(). The reasoning: on an HSM, deleting a key physically removes it from the device — there is noDestroyedstate — so every key present in a slot is by definition non-destroyed. Therefore delegating to the existing key-count path is semantically correct.A full
mockall-generatedMockHsmtest double was added tohsm_store.rstests, and a new test verified thatcount_all_non_destroyedreturns the same value ascount_non_destroyed_keysfor a mock HSM with two slots (3 + 2 keys = 5 total). This ensures the method is never accidentally hard-coded to 0 again.e7eb0ffc— Cache & HSM Operation Metrics (Step 4)Metrics:
kms.cache.operations.total,kms.hsm.operations.totalWhat was done:
Cache side: Three recording points were added in
other_kms_methods.rsinsideget_unwrapped_object— the single code path that owns all cache interactions. A hit callsrecord_cache_operation("get", "hit"), a miss callsrecord_cache_operation("get", "miss"), and a successful insert callsrecord_cache_operation("insert", "ok"). Labels areoperationandresult.HSM side: Recording was placed at the real crypto-oracle dispatch point in
perform_crypto_operationin crypto_op.rs. When the resolved key routes to aResolvedKey::Oraclebranch (meaning the actual cryptographic operation runs on an HSM via PKCS#11), the code now callsmetrics.record_hsm_operation(Op::OP_NAME, model)after the oracle returns. Themodelstring ("softhsm2","utimaco", etc.) is resolved by a new utility functionhsm_model_from_prefixinuid_utils.rs, which looks up the configuredHsmInstanceParamsby UID prefix and returns the model field — or falls back to the raw prefix if no match is found. Wrap and Unwrap operations inwrap.rs/unwrap.rswere also wired individually with their ownrecord_hsm_operationcalls since they go through a separate code path. Labels areoperationandmodel.Closes #997